2 Input files

4 Computing the comparison statistics

4.1 Negative control

For each method compute number of identified regions (which are all false-positives) and their total length.

Two methods have correctly found no DMRs - defiant and dmrseq.

4.2 Simulated data

4.2.1 Helper function

Helper function which returns length of overlap of one range (gr1) with any of ranges in list gr2.

4.2.2 Ways of computing TPs and FPs

Three different ways of computing true positive, false positive and false negative DMRs were used: 1. Any overlap: A detected DMR is considered to be true positive (\(TP_d\)) if it has any overlap with any simulated DMR. A detected DMR is considered to be false positive (\(FP_d\)) if it has no overlap with any simulated DMR. A simulated DMR is considered correctly identified (\(TP_s\)) if it has any overlap with any detected DMR.

  1. Threshold: A detected DMR is classified as \(TP_d\) if it is overlapped by at least \(x\) % of its length by one or more simulated regions, \(x\) is a selected threshold (80 % in our case). Otherwise it is classified as \(FP_d\). Similarly, a simulated DMR is classified as \(TP_s\) it is covered by at least \(x\) % by one or more detected DMRs.

  2. Proportional: If a detected DMR is covered by \(x*100\) % of its length by a simulated DMR (and by $(1-x)*100 $ % it is not covered by any simulated DMR), then it adds \(x\) to \(TP_d\) and \(1-x\) to \(FP_d\). Similarly, if a simulated DMR is detected from \(x*100\) %, then it adds \(x\) to \(TP_s\).

Observed false discovery rate (FDR) is then computed as \(FDR = FP_d / (FP_d + TP_d)\), power is computed as \(p = TP_s / 100\) (because there is 100 simulated DMRs in total).

For comparison and for verifying the results (see below) we have computed the FDR and power also based on individual nucleotides (being aware that this is not really correct for DMRs):

  1. Nucleotidewise: Every nucleotide position (for simplicity we have used all positions, not just C’s) is classified as:
  • TP if it is inside a simulated DMR and inside a detected DMR
  • FP if it is not inside a simulated DMR and is inside a detected DMR
  • TN if it is neither inside a simulated nor a detected DMR
  • FN if it is inside a simulated DMR but not inside a detected DMR

Observed FDR is then computed as \(FDR = FP / (FP + TP)\), power is computed as \(p = TP/(TP+FN)\).

4.2.3 Computation

Threshold for required overlap of regions in the threshold method:

For every method which provides q-value estimate we compute number of TPs and FPs, FDR, and power as described above.

# for every method
for (j in 1:length(sim.new[[1]])) {
  n <- names(sim.new[[1]])[j]
  data.tmp.threshold <- data.frame(spec.fdr = numeric(),
                         tp.found.threshold = numeric(),
                         fp.found.threshold = numeric(),
                         tp.sim.threshold = numeric(),
                         num.regions = numeric(),
                         obs.fdr = numeric(),
                         power = numeric(),
                         total.length = numeric(),
                         method = character(),
                         stringsAsFactors = F)
  data.tmp.all <- data.frame(spec.fdr = numeric(),
                         tp.found.all = numeric(),
                         fp.found.all = numeric(),
                         tp.sim.all = numeric(),
                         num.regions = numeric(),
                         obs.fdr = numeric(),
                         power = numeric(),
                         total.length = numeric(),
                         method = character(),
                         stringsAsFactors = F)
  data.tmp.prop <- data.frame(spec.fdr = numeric(),
                         tp.found.prop = numeric(),
                         fp.found.prop = numeric(),
                         tp.sim.prop = numeric(),
                         num.regions = numeric(),
                         obs.fdr = numeric(),
                         power = numeric(),
                         total.length = numeric(),
                         method = character(),
                         stringsAsFactors = F)
  data.tmp.nuc <- data.frame(spec.fdr = numeric(),
                         tp.found.nuc = numeric(),
                         fp.found.nuc = numeric(),
                         tp.sim.nuc = numeric(),
                         num.regions = numeric(),
                         obs.fdr = numeric(),
                         power = numeric(),
                         total.length = numeric(),
                         method = character(),
                         stringsAsFactors = F)
  
  # for every FDR cut-off
  for (i in 1:length(cutoffs)) {
    fdr <- cutoffs[[i]]
    # number of detected regions
    num.regions <- length(ranges(sim.new[[i]][[j]]))
    # total length of detected regions
    tot.length <- sum(width(sim.new[[i]][[j]]))
    
    if (num.regions > 0) {
      # lengths of overlaps with simulated (true) DMRs
      overlaps.foundToSim <- unlist(lapply(c(1:num.regions), function(x) meta_GR_overlap(ranges(sim.new[[i]][[j]][x]), ranges(read.dmr))))
      proportions.foundToSim <- overlaps.foundToSim/width(sim.new[[i]][[j]])
      
      # any overlap
      tp.found.all <- sum(overlaps.foundToSim>0)
      fp.found.all <- num.regions - tp.found.all
      
      # overlap at least threshold
      tp.found.threshold <- sum(proportions.foundToSim >=threshold)
      fp.found.threshold <- num.regions - tp.found.threshold
      
      # proportion
      tp.found.prop <- sum(proportions.foundToSim)
      fp.found.prop <- num.regions - tp.found.prop
      
      # number of correctly classified nucleotides
      tp.found.nuc <- sum(overlaps.foundToSim)
      fp.found.nuc <- tot.length - tp.found.nuc
      
      # lengths of overlaps of simulated regions with the detected ones
      overlaps.simToFound <- unlist(lapply(c(1:length(read.dmr)), function(x) meta_GR_overlap(ranges(read.dmr)[x],  ranges(sim.new[[i]][[j]]))))
      proportions.simToFound <- overlaps.simToFound/width(read.dmr)
      
      # any overlap
      tp.sim.all <- sum(overlaps.simToFound > 0)
      
      # overlap at least threshold
      tp.sim.threshold <- sum(proportions.simToFound >= threshold)
      
      # proportion
      tp.sim.prop <- sum(proportions.simToFound)
      
      # nucleotide-wise
      tp.sim.nuc <- sum(overlaps.simToFound)
      
      # save the data
      data.tmp.threshold[i,] <- c(fdr, tp.found.threshold, fp.found.threshold, tp.sim.threshold, num.regions, fp.found.threshold/num.regions, tp.sim.threshold/100, tot.length, n)
      data.tmp.all[i,] <- c(fdr, tp.found.all, fp.found.all, tp.sim.all, num.regions, fp.found.all/num.regions, tp.sim.all/100, tot.length, n)
      data.tmp.prop[i,] <- c(fdr, tp.found.prop, fp.found.prop, tp.sim.prop, num.regions, fp.found.prop/num.regions, tp.sim.prop/100, tot.length, n)
      data.tmp.nuc[i,] <- c(fdr, tp.found.nuc, fp.found.nuc, tp.sim.nuc, tot.length, fp.found.nuc/tot.length, tp.sim.nuc/tot.length.sim, tot.length, n)
      
    } else {
      # no regions identified
      data.tmp.threshold[i,] <- c(fdr, 0, 0, 0, 0, 0, 0, 0, n)
      data.tmp.all[i,] <- c(fdr, 0, 0, 0, 0, 0, 0, 0, n)
      data.tmp.prop[i,] <- c(fdr, 0, 0, 0, 0, 0, 0, 0, n)
      data.tmp.nuc[i,] <- c(fdr, 0, 0, 0, 0, 0, 0, 0, n)
    }
  }
  
  stats.each.threshold[[j]] <- data.tmp.threshold
  names(stats.each.threshold)[j] <- n
  stats.each.all[[j]] <- data.tmp.all
  names(stats.each.all)[j] <- n
  stats.each.prop[[j]] <- data.tmp.prop
  names(stats.each.prop)[j] <- n
  stats.each.nuc[[j]] <- data.tmp.nuc
  names(stats.each.nuc)[j] <- n
}

For the methods which do not provide q-value estimation, only the observed FDR and power are computed:

for (i in 1:length(read.sim.no.qval)) {
  n <- names(read.sim.no.qval)[i]
  # number of identified regions
  num.regions <- length(ranges(read.sim.no.qval[[i]]))
  # total length of identified regions
  tot.length <- sum(width(read.sim.no.qval[[i]]))
  
  data.tmp.threshold <- data.frame(spec.fdr = numeric(),
                         tp.found.threshold = numeric(),
                         fp.found.threshold = numeric(),
                         tp.sim.threshold = numeric(),
                         num.regions = numeric(),
                         obs.fdr = numeric(),
                         power = numeric(),
                         total.length = numeric(),
                         method = character(),
                         stringsAsFactors = F)
  data.tmp.all <- data.frame(spec.fdr = numeric(),
                         tp.found.all = numeric(),
                         fp.found.all = numeric(),
                         tp.sim.all = numeric(),
                         num.regions = numeric(),
                         obs.fdr = numeric(),
                         power = numeric(),
                         total.length = numeric(),
                         method = character(),
                         stringsAsFactors = F)
  data.tmp.prop <- data.frame(spec.fdr = numeric(),
                         tp.found.prop = numeric(),
                         fp.found.prop = numeric(),
                         tp.sim.prop = numeric(),
                         num.regions = numeric(),
                         obs.fdr = numeric(),
                         power = numeric(),
                         total.length = numeric(),
                         method = character(),
                         stringsAsFactors = F)
  data.tmp.nuc <- data.frame(spec.fdr = numeric(),
                         tp.found.nuc = numeric(),
                         fp.found.nuc = numeric(),
                         tp.sim.nuc = numeric(),
                         num.regions = numeric(),
                         obs.fdr = numeric(),
                         power = numeric(),
                         total.length = numeric(),
                         method = character(),
                         stringsAsFactors = F)
  
  # lengths of overlaps with the simulated (= true) regions
  overlaps.foundToSim <- unlist(lapply(c(1:num.regions), function(x) meta_GR_overlap(ranges(read.sim.no.qval[[i]][x]), ranges(read.dmr))))
  proportions.foundToSim <- overlaps.foundToSim/width(read.sim.no.qval[[i]])
  
  # any overlap    
  tp.found.all <- sum(overlaps.foundToSim>0)
  fp.found.all <- num.regions - tp.found.all
  
  # at least threshold    
  tp.found.threshold <- sum(proportions.foundToSim >=threshold)
  fp.found.threshold <- num.regions - tp.found.threshold
    
  # proportional  
  tp.found.prop <- sum(proportions.foundToSim)
  fp.found.prop <- num.regions - tp.found.prop
  
  # nucleotide-wise
  tp.found.nuc <- sum(overlaps.foundToSim)
  fp.found.nuc <- tot.length - tp.found.nuc
  

  # lengths of overlaps of the simulated regions with detected ones
  overlaps.simToFound <- unlist(lapply(c(1:length(read.dmr)), function(x) meta_GR_overlap(ranges(read.dmr)[x],  ranges(read.sim.no.qval[[i]]))))
  proportions.simToFound <- overlaps.simToFound/width(read.dmr)
     
  # any overlap 
  tp.sim.all <- sum(overlaps.simToFound > 0)
     
  # at least threshold 
  tp.sim.threshold <- sum(proportions.simToFound >= threshold)
    
  # proportional  
  tp.sim.prop <- sum(proportions.simToFound)
  
  # nucleotide-wise
  tp.sim.nuc <- sum(overlaps.simToFound)
  
  
  # save the data
  data.tmp.threshold[1,] <- c(NaN, tp.found.threshold, fp.found.threshold, tp.sim.threshold, num.regions, fp.found.threshold/num.regions, tp.sim.threshold/100, tot.length, n)
  data.tmp.all[1,] <- c(NaN, tp.found.all, fp.found.all, tp.sim.all, num.regions, fp.found.all/num.regions, tp.sim.all/100, tot.length, n)
  data.tmp.prop[1,] <- c(NaN, tp.found.prop, fp.found.prop, tp.sim.prop, num.regions, fp.found.prop/num.regions, tp.sim.prop/100, tot.length, n)
  data.tmp.nuc[1,] <- c(fdr, tp.found.nuc, fp.found.nuc, tp.sim.nuc, tot.length, fp.found.nuc/tot.length, tp.sim.nuc/tot.length.sim, tot.length, n)
  
  stats.each.threshold[[length(sim.new[[1]])+i]] <- data.tmp.threshold
  stats.each.all[[length(sim.new[[1]])+i]] <- data.tmp.all
  stats.each.prop[[length(sim.new[[1]])+i]] <- data.tmp.prop
  stats.each.nuc[[length(sim.new[[1]])+i]] <- data.tmp.nuc
}

Concatenate the data frames:

5 Plots

5.1 Negative control

5.1.1 Number of identified regions

5.1.2 Total length of identified regions

5.2 Simulated data

5.2.1 Observed FDR vs. power

The observed FDR vs. observed power for different cut-offs.

Since the relative ordering of the methods stays the same, the used method of classifying regions as TPs, FPs, etc. does not seem to play a big role. Thus, from now on we use only the data with any overlap.

5.2.2 Specified FDR vs. observed FDR

How well do the methods control FDR?

5.2.3 FP vs. TP

Number of false positives – detected regions that have no overlap with a real DMR – vs. number of true positives – real DMRs which were discovered.

5.2.4 Average length of reported regions

Average length of reported region for FDR cut-off 0.05 (and average length of the regions reported by BSseq and MethPipe), compared to average length of the true DMRs.

The average length of ranges reported by dmrseq is the highest from all the methods (although it corresponds quite well with the average length of true DMRs). Isn’t dmrseq good only because it reports longer intervals (that by chance overlap some true interval)? Let’s check by looking at the individual nucleotides.

5.2.5 FDR vs. power - nucleotide-wise

6 SessionInfo

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.1 (2018-07-02)
##  os       Ubuntu 16.04.5 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Zurich               
##  date     2019-01-10                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package          * version   date       lib source        
##  assertthat         0.2.0     2017-04-11 [1] CRAN (R 3.5.1)
##  backports          1.1.3     2018-12-14 [1] CRAN (R 3.5.1)
##  bindr              0.1.1     2018-03-13 [1] CRAN (R 3.5.1)
##  bindrcpp         * 0.2.2     2018-03-29 [1] CRAN (R 3.5.1)
##  BiocGenerics     * 0.26.0    2018-10-04 [1] Bioconductor  
##  bitops             1.0-6     2013-08-17 [1] CRAN (R 3.5.1)
##  Cairo              1.5-9     2015-09-26 [1] CRAN (R 3.5.1)
##  callr              3.1.1     2018-12-21 [1] CRAN (R 3.5.1)
##  cli                1.0.1     2018-09-25 [1] CRAN (R 3.5.1)
##  colorspace         1.3-2     2016-12-14 [1] CRAN (R 3.5.1)
##  crayon             1.3.4     2017-09-16 [1] CRAN (R 3.5.1)
##  crosstalk          1.0.0     2016-12-21 [1] CRAN (R 3.5.1)
##  data.table       * 1.11.8    2018-09-30 [1] CRAN (R 3.5.1)
##  desc               1.2.0     2018-05-01 [1] CRAN (R 3.5.1)
##  devtools           2.0.1     2018-10-26 [1] CRAN (R 3.5.1)
##  digest             0.6.18    2018-10-10 [1] CRAN (R 3.5.1)
##  dplyr              0.7.8     2018-11-10 [1] CRAN (R 3.5.1)
##  DT               * 0.5       2018-11-05 [1] CRAN (R 3.5.1)
##  evaluate           0.12      2018-10-09 [1] CRAN (R 3.5.1)
##  fs                 1.2.6     2018-08-23 [1] CRAN (R 3.5.1)
##  GenomeInfoDb     * 1.16.0    2018-10-04 [1] Bioconductor  
##  GenomeInfoDbData   1.1.0     2018-10-04 [1] Bioconductor  
##  GenomicRanges    * 1.32.7    2018-09-20 [1] Bioconductor  
##  ggplot2          * 3.1.0     2018-10-25 [1] CRAN (R 3.5.1)
##  glue               1.3.0     2018-07-17 [1] CRAN (R 3.5.1)
##  gtable             0.2.0     2016-02-26 [1] CRAN (R 3.5.1)
##  htmltools          0.3.6     2017-04-28 [1] CRAN (R 3.5.1)
##  htmlwidgets        1.3       2018-09-30 [1] CRAN (R 3.5.1)
##  httpuv             1.4.5.1   2018-12-18 [1] CRAN (R 3.5.1)
##  httr               1.4.0     2018-12-11 [1] CRAN (R 3.5.1)
##  IRanges          * 2.14.12   2018-09-20 [1] Bioconductor  
##  jsonlite           1.6       2018-12-07 [1] CRAN (R 3.5.1)
##  knitr              1.21      2018-12-10 [1] CRAN (R 3.5.1)
##  labeling           0.3       2014-08-23 [1] CRAN (R 3.5.1)
##  later              0.7.5     2018-09-18 [1] CRAN (R 3.5.1)
##  lazyeval           0.2.1     2017-10-29 [1] CRAN (R 3.5.1)
##  magrittr           1.5       2014-11-22 [1] CRAN (R 3.5.1)
##  memoise            1.1.0     2017-04-21 [1] CRAN (R 3.5.1)
##  mime               0.6       2018-10-05 [1] CRAN (R 3.5.1)
##  munsell            0.5.0     2018-06-12 [1] CRAN (R 3.5.1)
##  pillar             1.3.1     2018-12-15 [1] CRAN (R 3.5.1)
##  pkgbuild           1.0.2     2018-10-16 [1] CRAN (R 3.5.1)
##  pkgconfig          2.0.2     2018-08-16 [1] CRAN (R 3.5.1)
##  pkgload            1.0.2     2018-10-29 [1] CRAN (R 3.5.1)
##  plotly           * 4.8.0     2018-07-20 [1] CRAN (R 3.5.1)
##  plyr               1.8.4     2016-06-08 [1] CRAN (R 3.5.1)
##  prettyunits        1.0.2     2015-07-13 [1] CRAN (R 3.5.1)
##  processx           3.2.1     2018-12-05 [1] CRAN (R 3.5.1)
##  promises           1.0.1     2018-04-13 [1] CRAN (R 3.5.1)
##  ps                 1.3.0     2018-12-21 [1] CRAN (R 3.5.1)
##  purrr              0.2.5     2018-05-29 [1] CRAN (R 3.5.1)
##  R6                 2.3.0     2018-10-04 [1] CRAN (R 3.5.1)
##  Rcpp               1.0.0     2018-11-07 [1] CRAN (R 3.5.1)
##  RCurl              1.95-4.11 2018-07-15 [1] CRAN (R 3.5.1)
##  remotes            2.0.2     2018-10-30 [1] CRAN (R 3.5.1)
##  rlang              0.3.0.1   2018-10-25 [1] CRAN (R 3.5.1)
##  rmarkdown          1.11      2018-12-08 [1] CRAN (R 3.5.1)
##  rprojroot          1.3-2     2018-01-03 [1] CRAN (R 3.5.1)
##  S4Vectors        * 0.18.3    2018-10-04 [1] Bioconductor  
##  scales             1.0.0     2018-08-09 [1] CRAN (R 3.5.1)
##  sessioninfo        1.1.1     2018-11-05 [1] CRAN (R 3.5.1)
##  shiny              1.2.0     2018-11-02 [1] CRAN (R 3.5.1)
##  stringi            1.2.4     2018-07-20 [1] CRAN (R 3.5.1)
##  stringr            1.3.1     2018-05-10 [1] CRAN (R 3.5.1)
##  testthat           2.0.1     2018-10-13 [1] CRAN (R 3.5.1)
##  tibble             1.4.2     2018-01-22 [1] CRAN (R 3.5.1)
##  tidyr              0.8.2     2018-10-28 [1] CRAN (R 3.5.1)
##  tidyselect         0.2.5     2018-10-11 [1] CRAN (R 3.5.1)
##  usethis            1.4.0     2018-08-14 [1] CRAN (R 3.5.1)
##  viridisLite        0.3.0     2018-02-01 [1] CRAN (R 3.5.1)
##  withr              2.1.2     2018-03-15 [1] CRAN (R 3.5.1)
##  xfun               0.4       2018-10-23 [1] CRAN (R 3.5.1)
##  xtable             1.8-3     2018-08-29 [1] CRAN (R 3.5.1)
##  XVector            0.20.0    2018-10-04 [1] Bioconductor  
##  yaml               2.2.0     2018-07-25 [1] CRAN (R 3.5.1)
##  zlibbioc           1.26.0    2018-10-04 [1] Bioconductor  
## 
## [1] /home/ubuntu/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library